Recognition Using Word Collocation

نویسندگان

  • Tao Hong
  • Jonathan J. Hull
چکیده

A relaxation-based algorithnl is proposed that improves the performance of a text recognition technique by propagating the influence of word collocation statistics. Word collocation refers to the likelihood that two words co-occur within a fixed distance of one another. For example, in a story about water transportation, it is highly likely that the word "river" will occur within ten words on either side of the word "boat." The proposed algorithm receives groups of visually similar decisions (called neighborhoods) for words in a running text that are COlnputed by a word recognition algorithm. The position of decisions within the neighborhoods are modified based on how often they co-occur with decisions in the neig!l~Q:t:bp2g~~()f_()th~r )J:~~IJ?y wo-rds-.-ThIs--proces-sis-Tter-ate~fa number of times effectively propagating the influence of the collocation statistics across an input text. This improves on a strictly local analysis by allowing for strong collocations to reinforce weak (but related) collocations elsewhere. An experimental analysis is discussed in which the algorithm is applied to improving text recognition results that are less than 60 percent correct. The correct rate is effectively improved to 90 percent or better in all cases.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Degraded text recognition using word collocation

A relaxation-based algorithm is proposed that improves the performance of a text recognition technique by propagating the in uence of word collocation statistics. Word collocation refers to the likelihood that two words co-occur within a xed distance of one another. For example, in a story about water transportation, it is highly likely that the word \river" will occur within ten words on eithe...

متن کامل

Word Segmentation for Urdu OCR System

This paper presents a technique for Word segmentation for the Urdu OCR system. Word segmentation or word tokenization is a preliminary task for understanding the meanings of sentences in Urdu language processing. Several techniques are available for word segmentation in other languages but not much work has been done for word segmentation of Urdu Optical Character Recognition (OCR) System. A me...

متن کامل

Degraded Text Recognition Using Word Collocation and Visual Inter-Word Constraints

Given a noisy text page, a word recognizer can generate a set of candidates for each word image. A relaxation algorithm was proposed previously by the authors that uses word collocation statistics to select the candidate for each word that has the highest probability of being the correct decision. Because word collocation is a local constraint and collocation data trained from corpora are usual...

متن کامل

Recognition of word collocation habits using frequency rank ratio and inter-term intimacy

0957-4174/$ see front matter 2013 Elsevier Ltd. A http://dx.doi.org/10.1016/j.eswa.2013.01.003 ⇑ Corresponding author. Tel.: +852 27887756; fax: E-mail addresses: [email protected] (P. T (T.W.S. Chow). An effective algorithm for extracting two useful features from text documents for analyzing word collocation habits, ‘‘Frequency Rank Ratio’’ (FRR) and ‘‘Intimacy’’, is proposed. FRR is deriv...

متن کامل

Semantics-Driven Recognition of Collocations Using Word Embeddings

L2 learners often produce “ungrammatical” word combinations such as, e.g., *give a suggestion or *make a walk. This is because of the “collocationality” of one of their items (the base) that limits the acceptance of collocates to express a specific meaning (‘perform’ above). We propose an algorithm that delivers, for a given base and the intended meaning of a collocate, the actual collocate lex...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012